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METHODS FORTME-ALIGtsnVIENT OP LIQUID CHROMATOGRAPHY-MASS 
SPECTROMETRY DATA 

FIELD OF THE IKVENTION 

[0001] The present invention relates generally to analysis of data collected by 
analytical techniques such as cln-omatogi'aphy and specltometry. More particnlaiiy, it 
relates to methods for time-aligning muiti-dimensionai chromatograms of different 
samples to esiable automated company among sanrple daia. 

BACKGROUND OF THE mVENTION 

[0002] The higli sensitivity and resohition of liquid chix>matograpiiy-mass 
spectrometry (LC-MS) make it an ideal tool for cotoprehensive analysis of complex 
biological saiiq>ies. Comparing spectra obtained from samples corresponding to 
differeait patleaat cohorts (e.g., diseased versus non-diseased, or drug responders versus 
non-responders) or subjected to different stimuli (e.g,, drug adroinistration regimens) 
can yield valuable information about sample components correlaied with particular 
conditions. Such components may serve as biological markers ihat enable earlier and 
more precise diagnosis, patient stratification, or prediotioii of clinical outcomes. They 
may also guide the discovery of suitable and novel diiig targets. Because this 
^proach extracts a large amount of information fiom a very small sample size, 
automated data collection and analysis methods are desirable. 

[0003] LC-MS data are reported as intensity or abundance of ions of varying 
mass-to-charge ratio (m/z) at varying chromatographic retention times. A two- 
dimensional spectmisa of LC-MS data, from a single sample is shown in FIG, 1, in 
which the daxlaiess of points corresponds to signal intensity. A horizontal slice of the 
spectrum yields a mass chromatogram, the abundance of ions in a particular m/z range 
as a function of retention time. A vertical slice is a mass spectaim, a plot of 
abundance of ions of varying m/z at a particular retention time inteival. The two- 
dimensional data are acquired by performing a mass scan at regular intervals of 
retention time. Summing the naass spectrum at each retention time yields a total ion 
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cbromatogram (TIC), the abmdanoe of all ions as a fonction of retention time. Local 
maxima in intensity (witlx respect to both retention tiine and xn/z) are referred to as 
pealcs. In general, peaks may span several retention time scan intemls and mJz 

values, 

[0004] One significant obstacle for automated analysis of LC-MS data is the 
iioniiuear variability o£ cteomatograpHc retention times, which can exceed the width 
of peaks along the retention time axis substantially. This variability arises from, for 
example, changes in colnim chemistry over time, insti-ument drift, interactions among 
sample 'components, protein modifications, and minor changes in mobile phase 
composition. While constant time offsets caax be con-ected for easily, nonlinear 
variations are more problematic and significantly hamper the recognition of 
corresponding peaks across sample spectra. This problem is illustrated by the 
chromatogi-ams of BIG. 2, in ^^nch the dotted and soHd curves represent total ion 
chromatograms of samples &om two diifet^t patients. While it can be assmned that 
the dotted cijrve has beea time-shifted from the solid curve, it is difficult to predict 
from the two cmves to which of the two sohd peaks the dotted peai: coiresponds. 

10005] Various melhods have been provided in the art for addressing the pix3bi6m 
of chromatographic retention time shifts, includhrg correlation, curve fittmg, and 
dynamic programming methods such as dynamic time waiping and correlation 
optimized waiping. For example, a time waiping algorithm is applied to gas 
chromatography/Fonrier transform mfiared (FT-lR)/mass spectrometry data from a 
gasoline sample in CP. Wang and T.L. Isenhoi^r, "Time-warping algorithm applied to 
chromatographic peak matching gas chromatography/Fourier ti-amform infrared/mass 
■ spectrometry," Anal. Chem. 59: 649-654, 1987. In this method, a single VT-JK 
interferogram is aligned with a TIC. Wtele this method may be effective for simple 
samples, it may be inadequate for more complex samples such as biological flmds, 
which can contam thousands of different proteins and peptides, yielding Ihousands of 
potentially relevant and, more importantly, densely spaced Cm bolh m/z and retention 
time) peaks. 

[0006] There is still a need, therefore, for a robust metliod for time-ahgnmg 
chromatographic-mass spectrometric data. 
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BRIEF DESCRIPTION OB THE HGURES 

[OOOTi FIG. 1 (prior art) shows a sample two-dimensional liquid 

cbromatograpliy-mass spectrometry (LC-MS) data set. 
[0008] FIG. 2 is a scttematio diagram of portions of total ion ohromatograms of 
two different samples, illustrating the difficulties in properly thne-aHgiiing 
spectra. 

[0009J FIG. 3 is a flow diagram of one emtrodiment of the present invention, a 

method for comparing samples. 
{0010] FIGS. 4A'-4B illustrate aspects of a dynamic time wajping (DTW) method 

. according to one embodiment of the present invention, 
tOOll] FIG, 5 shows a grid of chromatographic time points, used in DTW, with 

an optimal route fbrough the grid indicated. 
[0012] FIGS. 6A-6B illustrate two constraints on a DTW method accordmg to 

one embodiment of the pi-esent invention. 
[0013] FIGS. 7A-7C ilkstrate aspects of a locally-weighted regression smoothing 

method according to one embodimeait of the present invention. 
10014] FIGS. 8A-8B show corresponding pealcs of one reference and three test 

LC-MS data sets before and after time-alignment by DTW. 
[0015] FIG. 9 is aplot showing results of aUgmnent of LC-MS data sets by robiist 

LOESS and DTW. 

DETAILED DESCRIPTION OF THE INVEKHON 

[0016] Vaiious embodiments of the present invention provide mediods for time- 
aligning two-dimensional chi-omatography-mass spectrometry data sets, such as liquid 
chromatography-mass spectiometiT (LC-MS) data sets, also referred to as spectra. 
These data sets can have nonlinear variations in retention time, so that corresponding 
peaks (i.e., pealcs representing the same analyte) in different sanxples elate from the 
olwmatographic column at different times. Additional embodiments provide 
methods for compaiing samples and data sets, metliods for identifying biological 
markers (biomarkers), aligned spectra produced according to these methods, samples 
compared according to these mefliods, biomarlcers identified accoi-ding to these 
methods, and methods for using the identified biomarkers for diagnostic and 
ttierapeutic applications. 

3 
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[0017] The methods are effective at aHgning two-dimensional data sete obtained 
from both simple and complex samples. Although coinpiex and simple are relative 
terms and ai-e not intended to Undt the scope of the present invention in any way, 
complex samples typically have many more and more densely spaced spectral pealcs 
than do simple samples. For examples, complex samples such as biological samples 
may have upwards of hundreds or thousands of peaks in sixty minutes of retention 
time, such that the total ion chromatogi-am (TIC) is too complex to allow resolution of 
individual features. Rather than use composite one-dimensional data such as the TIC, 
the methods in embodiments of the present invention use data fiom individual mass 
cbromatogiams, i,e., data representing abimdances or intensities of ions in particular 
miz ranges at particular retention times. The m/z range included witMn a single mass 
chromatogram may reflect the instrument precision or may be tlie result of 
preprocessing (e.g., binning) of the raw data, and is typicatly on tb^ order of between 
about 0.1 and 1.0 atomic mass unit (amii). Mass scans typically occur aft intervals of 
between about one and about three seconds. 

10018] lii some embodiments of the present invention, computations are referred 
to as being performed "m dependence on at least two mass chromatograms from each 
data set." This phrase is to be understood as referring to computations on individual 
data fiom a mass chromatogram, rather than to data summed over a number of 
ohromatograms. 

[00191 While embodiments of the invention are described below with refei'ence to 
chromatography and mass spectrometry, and particularly to liquid chromatography, it 
will be appai>ent to one of skiU in the art how to apply the methods to any otiier 
liyphenated chromatographic technique. For example, the second dimension may be 
any type of electromagnetic spectroscopy such as microwave, far inftared, infrared, 
Raman or resonance Raman, visible, ultraviolet, far ulh-aviolet, vacmmi ultraviolet, x- 
ray, or uilxaviolet fluorescence or phosphorescence; any magnetic resonance 
spectroscopy, such as nuclear magnetic resonance (NMK) or electron paramagnetic 
resonance (EPR); and any type of mass spectrometiy, including ionization methods 
such as electron impact, chenncal, Iheimospray, electrospray, matrix assisted laser 
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desoiption, and induGtively coupled plasma ionizaiioii, and any dstection methods, 
iixcluding sector, quadrupote, ion ixsp, time of flight, and Fomer trans&na detection. 

[0020] Time-aHgnment methods are appHed to data sets acquired by per&nning 
cliroaiatographie and spectrometric or spectroscopic methods on chemical or 
biological samples. The samples can be in any homogeneous or heterogeneous form 
that is compatible with the cliromatographic instnunent, for example, one or more of a 
gas, liquid, soHd, gel, or hquid crystal. Biological samples that can be analyzed by 
embodiments of the present invention include, without limitation, whole organisms; 
parts of orgamsms (e.g., tissue samples); tissue homogenates, extracts, in&sions, 
suspemons, excretions, secretions, or emissions; administered and recovered 
material; and culture supematants. Examples of biological fluids include, without 
limitation, whole blood, blood plasma, blood s&rma, urine, bile, cerebrospinal fluid, 
milk, saHva, mucus, sweat, gastric juice, pancreatic juice, seminal fluid, prostatic 
fluid, sputum, broncheoalveokr lavage, and synovial fluid, and any ceil suspensions, 
extracts, or concentrates of these Mds. Non-biologioal samples include air, water, 
liquids &om manufacturing wastes or processes, foods, and the like. Samples may be 
correlated with particular subjects, cohorts, conditions, time points, or any other 
suitable descriptor or category. 

{0021] FIG. 3 is a flow diagram of a general method 20 according to one 
embodiment of the present invention. The metiiod is typically implemented in 
softwai-e by a computer system iu commumcation with an analytical instinanent such 
as a liquid chromatography-mass spechometry (LC-MS) instrument, hi a first step 
22, raw data sets are obtained, e.g., from the instrument, from a diEferent computer 
system, or from a data storage device. The data sets, which are also referred to as 
spectra or two-dWensional data sets or spectra, contain intensity values for discrete 
values (or ranges of values) of cliromatographic retention time (or scan index) and 
mass-to-charge ratio (m/z). At each scan time of the mstrument, an entire mass 
spectrum is obtained, and the collection of mass spectra for the chromatographic run 
of that sample makes up the data set. Typically, a collection of data sets is acquired 
from a large number (i.e., more than two) of samples before subsequent processing 
occurs. 
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[0022] In an optional next step 24, the data sets sa:e preprocessed using 
conventional algoritiims. Exarciples of preprocessing technique applied include, 
without Hmitafion, baseline subtraction, smoothing, noise reduction, ds-isottoping, 
normalization, and pealc list creation, Additionally, tlie data can be binned into 
defined rsxlz jjitervals to create mass chromatograms- Data are collected at discrete 
scan times, but m/z values in the mass spectra are typically of very high mass 
precision. In order to create mass chromMogranas, data falling within a specified m/z 
interval (e.g.^ 0.5 amu) ai-e combined into a composite value for that interval. Any 
suitable binning algorithm may be eraployed; as is Icaown in the art, t!\e selection of a 
binning algorithm and its parameters may iiave implications for data smoothness, 
fidelity, and quality. 

[00231 ^ step 26, a time-aligning algorithm is applied to one or moi-s^ pair of data 
sets. One data set can be chosen (arbitrarily or according to a criterion) to serve as a 
reference spectrum and all other data sets time-aligned to this spectrvim. For exaniple, 
assuming the samples are analyzed on the instrument consecutively, the reference data 
set can correspond to the sample analyzed in the middle of the process. Alteraaiively, 
a feedback mediod can be implemented in which the degree of time shift is measured 
for each data set, potentially with respect to one or more of the data sets chosen 
adiitrarily as a reference data set, and the one with a median time shift, accorduig to 
some metric, selected as the reference data set. Data sets can also be evaluated by a 
perceived or actual quality metric to determine which to select as the reference data 
set. 

{00241 After the data sets are aligned to a common retention time settle, the 
aligned data sets can be compared automatically in step 28 to locate features that 
differentiate the spectra. For example, a peak that occurs in only certain spectra or at 
significantly different mtensity levels in different spectra may represent a biological 
marker or a component of a biological marker tliat is indicative of or diagnostic for a 
characteristic of the relevant samples (e.g., disease, response to therapy, patient group, 
disease progression). If desired, the identity of the ions resporisible for the 
distiagdshing features can be identified. Biological markers may also be more 
complex combinations of spectraj features or sample components v^ith or without 
other clinical or biological fectorg. Identifying spectral ditferences and biological 
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markers is a multi-step process and will not be described in detail lierein. For more 
inforraalion, see U.S. Patent AppHcation No, 09/994,576, "Methods for Efficiently 
Mining Broad Data Sets for Biological Marters," filed 11/27/2001, which is 
incoxporated herein by reference. In general, this step 28 is referred to as differeatial 
phenotyping, because differences among phenotypes, as represented by the 
comprehensive (rather than selective) LC-MS spectrum of expressed proteins and 
small molecules, are detected. 

[00251 Step 26, time-aligning pairs of spectra, can be implemented in many 
different ways, ha one embodiment of the im^ention, spech-a are aUgned using a 
variation of a dynamic time warping (DTW) method. DTW is a dynaniic 
progiamming techjdque that was developed in the field of speech recognition for 
time-ahgning speech paltems and is described in H. Satoe and S. Chiba, "Dynamic 
pragraimmng algorithm optimization for spoken word recognition," IEEE Trans. 
Acoust, Speech. Signal Process. ASSP-26; 43-49, 1978, which is incorporated herein 
by reference. 

[00261 3b embodiments of the present invention, DTW aligns two data sets by 
nonlinearly stretching and contracting CVaiping") ihe time component of the data 
sets to synchronize spectral features and yield a mirdmmn distance between the two 
spectra. In asyimnetric DTW, a test data set is waiped to align with a reference data 

set. Alternatively, in symmetric DTW, bofib data sets ai-e adjusted to fit a common 
time index. The following description is of asymmetric warping, but it wiU be 
apparent to one of ordinai-y slcill in the art, upon reading this description, how to 

perform the analogous symmetric waiping. 

[00271 BIG- is a plot of two chromatograms, labeled test and reference, whose 
time scales are nonlinearly related. That is, peaks representing identical analytes, 
referred to as corresponding pealts (and the corresponding points that make up these 
peaks), occur at different retention times, and there is no linear transformation of time 
components lhat will map corresponding peaks to the same retention times. Although 
the data are shown as contumous curves, each data set consists of discrete values (an 
entire mass spectrum) at a sequence of time indices; for clarity, only a single mtensity 
value, ra&er than m entire mass spectrum, is shown at each time point In the figure. 
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corresponding points are connected by dashed lines, whidi repi-esent a mapping of 
time points ia the reference data set to time points in the test data set This mapping is 
shown more explicitly in the table of FIG. 4B. The object of a DTW algorithm is to 
identify this time point mapping, from which an aHgned reference data set may be 
constructed. Note that DTW aUgos the entire data set, and not just pea3cs of the data 
set, and that DTW yields a discrete time point mapping, rather than a function iJiat 
transfonns the original time points into aKgned time points. As a result, some points 
(reference and test) do not get mapped, and unmapped points can be handled as 
described below. 

{00281 Conceptually, the DTW method considers a set of possible time poiait 
mappings and identifies the mappmg that nnnimizes an accumulated distance function 
between the reference and test data sete. Consider the grid in MG. 5, in which rows 
correspond to /time indices i m the test data set and columns to / time indices/ in the 
reference data set (/ and Jean be different). Each possible time point moping can be 
represented as a route c(A) through this grid, where c(/c) = [i(^),iW] and 1 <K. 
For example, if the test and reference data sets were perfectly aligned, the route would 
be a diagonal beginning in the upper left cell and proceeding to the lower ri^t cell of 
the grid. The selected route rep-esents the optimal time point mapping, 

[0029] The set of possible routes is Ihnited by three types of constraints: endpoint 
constraints; a local continuity constraint, winch defines local features of the path; and 
a global constraint, which deJBnes the allowable search space for the path. The 
endpoint constraint equates the first and last tune point in each data set. In the grid, 
the upper left and lower right cells are fixed as the start and end of the path, ■ 
respectively, i.e., c(l) - [1, 1] and c{K) - [/, J]. The local contimuty constraint forces 
the path to be monotonic with a non-negative slope, meaning that, for a path c{Jc) = 
[i{k)J(Jc)l i{m) ^Kk) and ;(fc+l) >J(,k). TMs condition maintains the order of time 
points. An upper bound can also be placed on the slope to prevent excessive 
compression or expansion of time scales. The result of these conditions is that the 
path to an individual cell is hmited to one of tie three illustrated in jFZG. 6A. Finally, 
the global constraint limits the path to a specified number of grid places fi-om the 
diagonal, illustnited schematically in FIG, 6B. This latter constraints con&ies the 
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solution to one that is physically realizable while also sabstantially limitmg lixe 

computation time. 

[00301 Tlie optimal path tbroiigli the grid is one to nmmrazes the accumulated 
distance function between the test and reference data sets over the route. Each cell [i, 
j] has m associated distance ftmction between data sets at the particular t and j time 
indices. The distance fimction can take a variety of different forms. If only a single 
chromatogram (e.g., the TIC) were considered, tire distance fimction d^j between 
points tf-^ and ti^"^ would be: 

where I/''^' is the intensity value of the reference spectrum and Ji'^ is Uie inteosity 
value of the test spectiw. In embodixneafe of the present invention, however, M 
mass ck-omatograms of each data set are considered in computing the distmce 
fimction, where M and so, in one embodiment, the distance ftraction is: 

where Ii/'^is the/ intensity value of the k'" reference chromatogram and 4;"^ is the 
i^* intensity value of the test chromatogram. Both chromatogi-ams are for a 
single m/z range. Each cell of the grid in FIG. 5 is filled witii the appropriate value of 
the distance fimction, and a route is chosen tteougb the malxix that minimizes the 
accumulated distance fimction obtained by summing the values in each cell traversed, 
subject to tiie above-described constraints. Note that the two terms distance and route 
are not related; the distance refers to a metric of the dissimilarity between data sets, 
while the route refers to a path through the grid and has no relevant distance. 

[003X] The route-finding problem can be addressed using a dynamic 
programming approach, in which tfee larger optiimzation problem is reduced to a 
series of local problems. At each allowable cell in the grid (FIG. 6B), the optimal 
one of the three (BIG. 6A) single-step paths is identified. Aft«: ail cells have been 
considered, a globally opthnal route is iBconstmcted by stepping backwards through 
the grid from the last cell. For more mformation on dynamic progi-amming, see T.H. 
Gormen et al., Introduction to Algorithm (2^^ ed.), Cambridge: MTT Press (2001), 
which is incorporated herein by reference. 
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10032] Locally optimal paths are selected by iniiaimizmg the accramTlated distance 
from the initial cell to the cmrmt cell For the three potential single-step pa&s to the 

cell [ij], the accxmiulated distances are: 

where Dt/^ represents the accumulated distaaoe from [1, 1] to when path/) is 
traversed, dij is computed ftom eqaation (2), and Di.ij.i, Dt-ij-u and Dw^j are 
evaluated in previous steps. The coefficieat 2 is a weighting fector that inclines the 
path to follow the diagonal, Xt may talce on other values as desired. The iranimized 
accumulated distance for the cell [rj] is given by: 
A.y=nnn(A/0- 

Tliis value is stored in an accumulated distance matrix for use in subsequent 
calculations, and the selected value of p is stored in an mdex matrix. 

[00333 The dynamic programming algori&m proceeds by stepping throu^ each 
cell and finding and storing the minimum accumulated distances and optimal uidices. 
Typically the process begins at the top left cell of the grid and moves down through 
all allowed cells before moving to the next column, with the allowable cells in each 
column defined by the global search space. After the final cell has been computed, 
the optimal route is found by txaversuig the grid backwards to the starting cell [1, 1] 
based on optimal paths stored in the index matrix. Note that the route cannot be 
constructed in the forward direction, because it is not Imown until subsequent 
calculations whether the current cell will lie on the optimal route. Once the optimal 
route has been determined, an aligned test data set can be constructed, 

10034] Unless the test and reference data sets are perfectly aligned, there are 
points in both sets that do not get mapped. When the test time scale is compressed, 
some intermediate test points do not get mapped. These points are discarded. When 
the test time scale is expanded, there are reference time points for wMch no 
corresponding test point exists. Values of the points can be estimated, e.g., by 
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Hnearly mterpolatmg between intensity values of sunoimding points that iiave been 

mapped to reference points. 

|003S] The above-described methods and steps can be varied in many ways 
without departing from the scope of the invention. For example, alternative 
constramts can be applied to the route (e.g., different allowable local slopes, end 
points not fixed but rather constrained to allowable regions, different global search 
space), and alternative distaiace fimctions can be employed. The weighting factors for 
local paths can be varied from the value 2 used in equations (3). Additionally, a 
normalization factor can be mcMed in the distance function. The distance fiinction 
above is based on intensity, but, depending on how the data set is represented, can be 
based on any other coefficient of features of the data set For example, the function 
can be computed iiom coefficients of wavelets, peaks, or derivatives by which the 
dala set is repi^ented. In this case, the distance is a measnre of the degree of 
alignment of these features. 

[0036J In the equations above, tlie distance function is computed based on data 
firom M individual mass ohromatograms. Any value of M is within the scope of the 
present invention, as ai-e any selection ctiteria by which chromatograms aie selected 
for moluaion. Reducing the nmnber of chromatograms ftom the total number in the 
data set (e.g., 2000) to M can decrease, the computation tms substantially. 
Additionally, excluding noisy ck-omatograms or those without peaks can improve the 
alignment accuracy There is generally an optimal range of Mthat balances aUgnment 
accuracy and computation time, and it is beneficial to choose a value of M m the 
tower end of the range, i.e., a value that minimizes computation time witliout 
sacrificing substantially the accuracy of time-ahgnment. It is also beneficial to 
include chromatograms containmg peaks throughout the range of retention time; this 
is particularly important near the begimiiiig and end of the chi-omatograpMc run, when 
there are fewer pealts. In one erabodiment, between about 200. and about 400 
chromatograms are used. Alternatively, between about 200 and about 300 
chromatograms are used. La another embodiment, M is about 200. 

[00371 A variety of selection criteria can be stpplied individually or jointly to 
select the chromatograms with which the distance fimction is computed. The 
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selection criteria or aieir parameters (e.g., intensity thresholds) can be predetermined, 
computed at mn time, or selected by a user. M can be a selected value (manually or 
automatically) or the result of applying the criterion or criteria (i.e., ikfchronsatograms 

happen to fit the criteria). 

[00381 One selection criterion is that a mass chromatogram have peaks in both the 
reference and test data sets, as detenniaed by a mainial or automated pealc selection 
algorithm. Pealc selection algorithms tjTDically apply an intensity threshold and 
identify local maxima exceeding the threshold as peaks. The peaks may or may not 
be required to be corresponding (m mJz and retention time) for the chromatogram to 
meet the criterion. Jf cojresponding pealcs are required, a relatively large window m 
retention time is applied to account for the to-he-corrected retention time shifts. , 

[0039J Another selection criterion is that maximum, median, or average intensity 
values in a mass chiomafogram exceed a specified intensity Uareshold, or that a single 
peak intensity or maximum, median, or average peak intensity values in the 
chromatopam exceed an intensity thi-eshold. Alternatively, at least one mdividual 
peak intensity or the maximum, median, or average peak intensity can be required to 
m between upper and lower intensity level thresholds. Anoth^ selection criterion is 
that the number of peaks in a mass chromatogram exceed a tiiresboid value. These 
criteria are typically applicable to both tiie reference and test mass cbromatograms. 

[0040] When the selection criterion involves an int^ity threshold, the threshold 

can be constant or vaiy with retention time to accommodate variations in mean or 
median signal intensity thi-oughout'a chromatograpiiic run. Often, the beginning and 
end of the run yields fewer and lower intensity peaks than occur m the middle of the 
run, and lower thresholds maybe suitable for th 



[00411 According to an alternative selection criterion, a set of the most orthogonal 
chromatograms is selected, i.e., the set that provides the most infoxmation. When an 
aoalyte is pr^ent in chromatograms of adjacent m/z values, these chromatograms 
may be redundant, providing no more mformation than is provided by a smgle 
chromatogram. Standard correlation methods can be appHed to select orthogonal 
chromatograms. The orthogonal chromatograms are selected to span ihe elution time 
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i-ange, so tttat just enough information is provided to align liie data sets accurately 
throughout the satire raage. In Uais case, the selection criterion contains an 
orthogonality metric and a retention time range. 

(0042] Individual selection criteria may be combined in many different ways. For 
example, in one composite selection criterion, peaks are first selected in the reference 
and test data sets using any suitable niajiual or automatic pealc selection method. 
Next, a filter is applied separately to the two data sets to yield two subsets of pealts. 
This filter caii be a single tk-eshoid or two (upper and lower) thresholds. A lower 
threshold ensraes that peaks are above the noise level, while an upper threshold 
excludes falsely elevated values reflecting a saturated instrument detector. 
Con-esponding pealcs ai-e then selected that appear in both the test and reference peak 
subsets. Cliromatograms coiresponding to these pealcs are included in computing the 
distance fimction. Altmiatively, Scorn the list of corresponding pealcs, M 
chipmatogrBins are diosen randomly, For example, if N coixesponding peaks are 
found, the chromatograms corresponding to every N/J\^ m/z value axe selected. 
Alteroatively, the M chromatograms can be selected from the corresponding peaks 
based on an intensity threshold or some other criterion. 

[0043] When more than one test data set is aligned to the reference data set, each 
pairwise alignment can be computed based on a different set of independently- 
selected chromatograms. 

[0044} In one embodiment of the invention, a weighting fector Wk is included in 
the distance function, causing different chromatograms to contribute unequally. As a 
result, certain chromatograms tend to dominate the sraaa and dictate the aligmneat. 
The weighted distance ftinction is: 

where Wk is the cliromatagram-dependent weighting factor. The functional form or 
value of the weighting factor can be detemined a priori based on user Icnowledge of 
the most relevant mass ranges. Alternatively, the weighting factor can be computed 
based on charactraisticg of the data. For example, the weighting fector can be a 
fimction of one or more of the following variables: Uie number of peaks p«ar 
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cbromatogram (peak loumber), selected by any manual or automatic method; the 
sipiaMo-ttoise ratio m a chromatogFam; and peak tbreshold or intensities. 
Chroniatogi-ams having more pealcs, higber signal-to-noise ratio, or higher peak 
intensities are typically weighted more than other cbromatograms. Any additionai 
variables can be included in the weighting factor. The fector can also depend on a 
combination of user knowledge and data values. 

[0045] In an alternative embodhnent of the invention, the time-aligning step 26 
employs locally-weighted regression smootliing. Rather than act on the raw (or 
preprocessed) data, this method time-aligns selected peaks in test and reference data 
sets. Peaks, defined by m/z and retention time values, are first selected from each 
data set by manual or automatic means. Potentially corresponding pealcs are 
identified fixjm the lists as peaks that fall within a specified range of m/z and retention 
time values. FIG. 7A shows an excerpt of a reference peak list and test pealc list with 
potentially corresponding peaks shaded. Ihese peaks are plotted in MG. 7B, which 
shows the window surrounding the reference peak that defines a region of potentially 
corresponding pealcs. Becaaise the nonlinear time variations have not yet been 
coirected, the window lias a relatively large retention time range, accounting for the 
maximum retention tune variation tibroughont the chromatographic run (e.g., five 
minute). 

[0046] For every pair of reference peak and potentially corresponding test peak, 
the data ai'e transformed from (tref, tmc) to {tavp A/), Vi^ere t^vg = {W+ ^'«*tV^ and A/ = 
tnf~ ttesf The resulting plot, for exemplary data sets, is shxwn in FIG. 7C. It is 
apparent from FIG. 7C that the points tend to cluster around a curve that represents 
tire nonlinear time variation between reference and test data sets. Knowmg this curve 
would enable correction of the time variation and ahgnment of the data sets. To do 
so, a smoothing algorithm is apphed to the transformed variables to yield a set of 
discrete values (iavg, winch can be transformed back to (t,.e/> tt^t)- Because the 
smoothing is applied to data points representmg pealcs, and because the result is a 
discrete mapping of points rather than a fimction. adjusted time values of data points 
between the peaks are then computed, e.g., by interpoktion. After all points have 
been mapped, ahgned data sets can be constructed. Typically, time points of the 



14 



wo 03/095978 PCT/US03/14729 

reference data set are fixed and tiie test data set modified. This process can be 
repeated to align all data sets to &e reference data set 



[0047] Oae suitable smoothing algoritfam is a LOESS algoiithm Qocally weighted 
scatterplot smooth), originally proposed in W.S, Qeveland, •'Robust iocalLy wei^ted 
regression aiid smoothing scatteiplots," J. Am. Stat. Assoc. 74: 829-836, 1979, and 
further developed in W.S. Cleveland and S.J. Devlin, Xocally wghted regression: 
an approach to regression aaalysis by local fitting," J. Am. Stat, Assoc. 83: 596-610, 
1988, both of which are incoiporated herein by reference. A LOESS function 
(sometimes called LOWESS) is available in many commercial mathematics and 
statistics software packages such as S-PLUS®, SAS, Maihematica, and MATLAB®. 

[0048] The LOESS method, described in more detail below, fits a polynomial 
locally to poiate in a window centered on a given point to be smoothed. Both the 
window size ("span") and polynomial degree must be selected. The spaa is typically 
specified as a percentage of the total number of points. In standard LOESS, a 
polynomial is fit to the span by weigjiting points in tlie window based on their 
distance &om the point to be smoothed. After fitting the polynomial, the smoothed 
point is replaced by the computed point, and the method proceeds to the next point, 
recalculating wei^ts and fitting a new polynomial. Each tone, even thon^ the entire 
span is fit by the polynomial, only the center poiat is adjusted. Because &e method 
operates locally, it is quite effective at representing the fine nonlinear variations in 
chromatographic retention time. 

[0049] A robust version of LOESS, which is mote resistant to outliers, computes 
the smoothed points in an iterative fashion by continuing to modify the weights until 
convergence (or based on a selected number of iterations). The iterative corrections 

are based on the residuals between the polynomial fit and the raw data points. After 
the points are fit using initial weights, subsequent weights are computed as the 
products of the initial weights and the new weiglits. Upon convergence, the span is 
moved by one point and the entire process repeated. In this manner, the polynomial 
regression weights ai-c based on both the distance from the point to be smoothed 
(distauce in abscissa valne) and the distance between the point and the curve fit 
(distance in ordinal value), yielding a very robust fit. 
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[00501 Specific details of the robust LOBSS fit are described below. It is to be 
undeistood that any variations ixi parameters, wd^tiag factors, and polyaomial 
degree are withhx the scope of tlie present invention. Eaoli discrete (Sa^t A/*) point is 
represented in the foixnulae below as (x,-, yi). The approximated value otyi computed 
from the polynomial fit is represented as ft- 

[0051] First, a window size is chosen and centered on the point to be smoothed, x. 
Suitable window sizes are between about 10% and about 50% (e.g., about 30%) of the 
total span of x,- values. The results may be sensitive to the span, and the optimal span 
depends on a number of factors, iucluding the tlneshold by which pealcs ai-e selected. 
For example, if the peak selection threshold is low, yielding a lai-ge number of densely 
located poiute, the optimal span size may be larger than if the pealc selection threshold 
were to yield fewer, less dense points. The span can also be selected by performing 
the smoothing using a few different spans and selecting the one tliat yields the best 
alignment according to a fit metric, a measiure of how well the smoothed values fit the 
appai-ent alignment fimction or of how much the A? value varies locally or globally 
across the retention time range. The smoothing can also be evaluated based on 
knowledge of die eixpected result. The Appoints within the chosen span are fit to a 
weighted polynomial of degree I (typically, Z = 2) by minimizing the regression merit 
fonction, £1 

M L fc=0 J 

where a* are the polynomial coefficients to be solved for and wt axe the regression 
weights for each point Xi in tire span. Initially, ihe weights W/ are ^ven by a tricubic 
fimction: 

^^««'^L|_^r5L,| I , (6) 

where x is the point being smoothed, xt are the individual points within the span, and 
^,nax is the point fartliest from x. The weights vary smoothly irom O.for the point 
farthest from the smoothed point to 1. for the smoothed point. All weights are zero for 
points outside the span. The regression merit function in equation (5) is minimized to 
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detennine the polynomial coefEcients a^. For standard LOESS, the smoothied value j> 
is computed firom the polynomial, and the spaa is moved one point to the right to 
8mooth &(? next point, 

10052] For robust LOl^S, these results are used to compute the roMst wdghte 
based on the residual r,- between the raw data value y; and polynomial value yi for each 
point in the span: 

ri = yi~h (7) 
and on the median absolute deviation MAD: 

MAD = median (Inj). (8) 
Erom-thesa, the robust weights wf^^'"' are computedi 

i_j_!L_| \r\<6MAD 

[6MAD) . (8) 

0 \ri\^6MAD 
The regression is perfbimed again for the span (firom equation (5)) using newly 
computed weights Wf ~ w/"^^' * wf^'^' to obtain a new curve fit, a new set of points j?/, 
and new residuals r*. This procedyre (computing robust weights and fitting the 
polynomial) is repeated uatil the curve fit converges to a desired precision or for a 
predeterroined number of iterations, e.g., about 5. Upon convergence, the y value of 
the point beiag smoothed, x, is replaced widi the cmve fit value. Only that point is 
replaced — all other points in the span remain the same. The span is tlien shifted one 
point to the right and the entire procedure repeated to smooth the point in the center of 
the span. Each time the curve fit is performed, the yt values used are the raw data 
values, not the smoothed ones. End points are treated as is commonly done in 
smoothing. 

[0053] After all i>,- values are obtdned, a mapping fi-om Ue/to tte^t is detenmned, 
and values for tntennediate points are computed by inteipolation. The retention time 
values of mapped teat points are then adjusted to align the complete data sets. The 
process is repeated for al test data sels. Note that if the goal of the method is to align 
corresponding peaks only, it is not necessary to find aligned time point values for the 
inteimediate points. 
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[0054] Although not limited to aay particular hardware configuratioii, the present 
inveiaioti is typically implemented in software by a system contaiiHng a compute that 
obtains data sets from an analytical instnunent (e.g„ LC-MS instrument) or ottier 
source. The LC-MS Histrvraxeat includes a Hquid obromatography instrument 
comected to a mass spectrometer by an interface. Hie computea: implementing the 
invention typically contains a processor, memory, data storage medium, display, and 
input device (e.g, keyboard and monse). Methods of tihe invention are executed by 
the processor ynder the direction of computer program code stored in the computer. 
Using techniques well known in tlie computer arts, such code is tangibly embodied 
within a computer program storage device accessible by the processor, e.g., witiiin 
system memory or on a computer-readable storage medium such as a hard disk or 
CD-ROM. The methods may be implemented by any means known in the art For 
example, any number of computer pxogi-araming languages, such as Java™, C-H-, or 
Peri, may be used. Furtliemiore, various pTogramming approaches such as procedm-al 
or object oriented may be employed. It is to be understood that Hie' steps described 
above are Mghly simplified versions of the actual processing performed by the 
computer, and that methods containing additional steps or rearrangemeiat of the steps 
d^cribed are wiliiin tiie scope of the present mveiition. 

EXAMPLES 

I0055J The foUowing examples are provided solely to illustrate various 
embodiments of the present invention and are not intended to limit &e scope of the 
invention to the disclosed details. 

EXAMPLE 1 : Pealcs aligned by dynamic time warping 

[005^] Pooled human serum from blood bank samples was latrafiltered fhrougji a 
104cDa membrane, and tlie resulting bigh-molec«lar weight fraction was reduced with 
dithiothreitol (DTT) and carboxymetiiylated with iodoacetic add/NaOH before bemg 
digested with trypsin. Digested samples were analyzed on a binary HP ilOO series 
HPLC coupled directly to a Micromass (Manchester, -UK.) LCT^ electrospray 
ionization (BSI) time-of-fUght (TOF) mass spectrometer equipped with a microspray 
source. PicoFrit™ fiised-silica capillary columns (5 im. BioBaslc Cm 75 fim x 10 
cm. New Objective, Wobum, MA) were run at a flow rate of 300 nL/min after flow 
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Splitting. An on-line trapping cartridge (Peptide CapTxap, Michimn BioresoTJices, 
Auburn, CA) allowed fa^ loading onto the capillary ooliumcu Injectioxi volume was 
20 \iL. Gradient elution -was acbieved using 100% solvent A (0.1% formic acid in 
water) to 40% solveaat B (0.1% fonnic acid in acetonitrile) over 100 min. 

[0057] Data sets were aligned by dynaimc time warping (DTW) implemented in 
MATLAE® (The MalttWorks, Cambridge, MA) with custom code, 

(OOSSj FIGS. SA-8B show a small region of data sets corresponding to fom- 
different samples, before and after alignment of the bottom three data sets (test) to the 
top (reference) data set using DTW. Corresponding peaces are mdicated. In all cases, 
the aligned pealcs are much closer (in retention time) to the reference peaks tlian they 
were before aligmnent. 

EXAIVfPLE 2: Data sets aligned by dynamic time warping and LOESS 
[0059] Pooled human serum from blood bank samples was ultrafiltered through a 
10-kDa membrane, and the resfultmg Mgh-molecular weight fraction i-educed with 
diftdothreitol (DTT) and carboxymethylated with iodoacetic acxd/NaOH before being 
digested with trypsin. Digested samples were analyzed on a binary HP 1100 series 
HPLC cor^led directly to a ThermoFionigan (San Jose, CA) LCQ DECA™ 
electrospray ionization (ESI) ion-tr-ap mass spectrometer using automatic gain control, 
PicoFrit™ fused-silica capillary columns (5 \sm BioBasic Cig, 75 fim x 10 cm, New 
Objective, Wobuni, MA) were run at a flow rate of 300 nUmin aftesr flow splitting. 
An on-line trapping cartridge (Peptide CapTrap, Michrom Bioresoui-ces, Auburn, CA) 
allowed fast loading onto the capillary coluxnn. Injection volume was 20 yL. 
Gradient elution was achieved using 100% solvent A (0.1% formic acid in water) to 
40% solvent B (0.1% formic acid m acetonitrile) over 100 min. 

[0060] Spectra were aligned using both dynamic time warping (DTW) and robust 
LOESS. Algorithms were implemented ki MATLAB® (The MathWorlcs, Cambridge, 
MA). Robust LOESS smoothing was performed using a prepackaged routine in the 
MATLAB® Curve Fitting Toolbox. DTW was implemeaited vnth custom MATLAB® 
code following the algoritiims described above. 
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[00611 FIG. 9 is a plot of transfoixaed data set variables At vs. ig^ showing 
alignment by robust LOESS and DTW, inverted triangles represent potentially 
coiitsponding automaticaliy-selected peaks, filled circles are points smootbed by 
robust LOESS, and the thin solid line is the data set corrected by DTW. The DTW 
points are nrach more densely spaced, because they are taken j&om the entire data set, 
rather than selected pealcs only. In Ihis example, both robust LOESS and DTW 
aocnrately track the time shift, with LOBSS following the local variations more 
closely. 

[0062] It should be noted that the foregoing description is only illustrative of the 
invention. Varions alternatives and modifications cm be devised by those sidUed in 
the art without departing Jrom Hie invention. Accordingly, the present invention is 
intended to embrace all such alternatives, modifications and variances which fall 
within tiie scope of the disclosed mvesntion. 
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CLAIMS 

Wliat is claimed is: 

1, A oomputer-k^lemented method for time-aligning at least two 
chrcanatography-mass spectrometry data sets, each comprising a plurality of 
mass chromatograms, said method comprising: 

a) computing a distance function between said data sets in dependence on at 
least two mass chromatograms from each data set; and 

b) aligning said data sets by imnimiziiig said distance function to obtain 
aligned data sets. 

2. The method of claim 1, whei-ean oae of said data sets is a reference data set 
and one of ^d data sets is a test data set, and wherein said test data set is 
aligned to said reference data set. 

3. The mefliod of claim 1, wherein said data sets are liquid chromatography- 
mass spectrometry data sets, 

4. The method of claim 1, whereia said distance fimction is computed in 
dependence on between about 200 and about 400 mass chromatograms 
fiom each data set. 

5. The method of claim I, ftuther comprising selecting said at least two mass 
chromatograms according to a selection criterion. 

6. The method of claim 1, wherein said distance i&mction is computed in 
dependence on a cbroroatogram-dependent weighting factor. 

7. The method of claim 6, wherein said chromatogram-dependent 
weighting factor is a fimction of at least one of a peak number, an 
uiteBsity threshold, and a sigaal-to-noise ratio. 

8. A plurality of chromatography-mass spectrometry data sets aligned according to 
Ihe meftiod of claim 1 . 
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9. A program storage device accessible by a processor, taagibly embodying a 
program of iostructions executable by said processor to perform metliod steps 
for a metliod for time-aHgiiiag cbromatography-mass spectrometry data sets, 
each comprising a plm:ality of mass ohromatograms, said meliiod steps 
oompiising: 

a) computing a distance iluiction between said data sets in depefldestice on at 
least two mass cliromatograms from each data set; and 

b) aligning said data sets by iiQimrm2ing said distance function to obtain 
aligned data sets. 

10. A method for comparing at least two samples, comprising: 

a) performing chi-omatography-mass spectrometry on each sample to obtain 
at least two data sets, each comprising aplm-ality of mass chromatograms; 

b) computing a distance fimction between two selected data sets in 
dependence on at least two mass drromatograms from each selected data 
set; 

c) aligning said selected data sets by minimizuig said distance fimction to 
obtain aligned selected data sets; and 

d) comparing said aligned selected data sets. 

11. The method of claim 10, wlierein one of said selected data sets is a 
reference data set and another of said selected data sete is a test data set, 
and wherein said test data set is aligned to said reference data set. 

12. The me&od of claim 10, wherein said chromatography-m^s spectrometry 

is liquid chromatography-mass spectrometry. 

13. The method of claim 10, fmther comprising aligning two additional data 
sets, -wherein at least one of said additional data sets differs from said 

selected data sets. 

14. The method of claim 10, furfcer comprising selecting said at least two 
mass chromatograms according to a selection criterion. 
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15. The method of claim 14, whei^iit said selection criterion is a user- 
provided selection criterion. 

16. The method of claim 14, wherein said selection criterion comprises 
an intensity threshold. 

17. The method of claim 14, wherem said selection criterion comprises a - 
number of chromatograms, 

18. Tiie metliod of claim 14, wherein said selection criterion comprises 

an, orthogonality metric. 

19. Hie method of claim 14, whei-ein said selection criterion comprises a 
retention time range. 

20. The mettiod of claim 10, wherein said distance function is computed m 
dependence on between about 200 and about 400 mass chromatograms. 

21. The metixod of claim 10^ wherein said distance function is computed in 
dependence on between about 200 and about 300 mass chromatograms. 

22. The method of claim 10, wherein said distance function is computed in 
dependence on about 200 mass chr omatograms. 

23. The metJiod of claim 10, wherein said distance function is computed in 
dependence on a weighting factor, 

24. The method of claim 23, wherem said weighting factor is a 
chromatogram-depejident weighting factor. 

25. The metliod of claim 24, wherein said chiomatogram- 
dependent weighting fector is a function of at least one of a 
peak numbei-, an intensity threshold, and a signal-to-noise ratio. 
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26. The meflbod of claim 10, fiirfher comprising identiftdng featores that 
diffesreatiate said aligned selected data sets. 

27. A plurality of samples compared according to the meihod of claim 10. 

28. A method for identifying a biomarker differentiating two cohorts, comprising: 

a) comparing at least two samples according to method of claim 10, at 
least one each of said samples represeaiting a different one of said two 
cohorts; and 

b) identifying a biomaricer in dependence on said comparison. 

29. A biomarker identified by tiie method of claim 28. 

30. A diagnostic metliod comprising detecting a biomadcer identified by flie method 
of claim 28. 

31. A computer-implemented metliod for time-aUgning at least two two- 
dimensional chromatography-mass spectrometry data sets, comprising: 

a) selecting peaks in said data sets; 

b) identifying potentially corresponding peaks from said selected pealcs; and 

c) performing a locally-weighted regression smoothing on said potentially 
corresponding peaks to obtain aligned data sets. 

32. Hie method of claim 31, wherein one of said data sets is a reference data 
set and one of said data sets is a test data set, and wherein said test data set 
is aligned to said reference data set, 

33. The method of claim 31, wherein said data sets axe liquid ohromatography- 
mass spectrometry d^ sets, 

34. Tlie method of claim 31, wherein said locally-weighted regression 
smoothing is a robust locally-weighted regression smoothing. 
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35. The me&od of ciaim 34, wheredn said robust locally-weighted 
regressioa smoothing comprises robust LOESS. 



36, The method of claim 3 1 , wherein said peaks are selected automatically. 

37. The method of claim 31, whereia said locally- weighted regression 
smoothing is performed in dependence on a span. 

38. A plurality of cliromatography-maas spectrometry data sets aligned according to 
the method of claim 3 1 . 

39. A program storage device accessible by a processor, tangibly embodying a 
progi-am of instructions executable by said processor to perform method steps 
for a method for time-aiiguing two-dimensional chromatography-mass 
spectrometry data sets, said method steps comprising: 

a) selecting peaics in said data sets; 

b) identifying potentially corresponding peaks from said selected peaks; and 

c) performing a locally-weighted regression smoothing on said potentially 
coiTespondmg peaks to obtain aligned data sets, 

40. A method for comparing at least two samples, comprising: 

a) performing chromatography-mass spectrometry on each sample to obtain 
at least two two-dimensional data sets; 

b) selecting peaks in two selected data sets; 

c) identifying potcdatially corresponxling peaks &om said selected peaks; 

d) performing a locally-weighted regression smoothing on said potentially 
corresponding peaks to obtain aligned selected data sets; and 

e) comparing said aligned selected data sets. 

41, The method of claim 40, wherein one of said selected data sets is a 
reference data set and another of said selected data sets is a test data set, 
and wherein said test data set is aligned to said reference data set 
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42. The metiiod of claim 40, wherein said ciiromatogr^liy-masg spectromeftry 
is Hqmd chromatography-mass spectometry. 



43. The method of claim 40, fbi&er comprising aligning two additional data 
sets, wtoin at least one of said additional data sets differs from said 
selected data sets. 

44. -Dae method of claim 40, further comprising identifying features thai 
difiEferentiate said aliped selected data sets. 

45. A pluraUty of samples compared according to the method of claim 40. 

46. A method for identifying a biomarker differentiating two cohorts, comprising: 

a) comparing at least two samples according to the method of claim 40, at 
least one each of said samples representing a different one of said two 
cohorts; mi 

b) identifying a biomarker in dependence on said comparison. 



47. 



A biomarker identified by the me&od of claim 40 . 
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